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Abstract Hand motion capture is a popular research 
field, recently gaining more attention due to the ubiq¬ 
uity of RGB-D sensors. However, even most recent ap¬ 
proaches focus on the case of a single isolated hand. In 
this work, we focus on hands that interact with other 
hands or objects and present a framework that suc¬ 
cessfully captures motion in such interaction scenar¬ 
ios for both rigid and articulated objects. Our frame¬ 
work combines a generative model with discriminatively 
trained salient points to achieve a low tracking error 
and with collision detection and physics simulation to 
achieve physically plausible estimates even in case of 
occlusions and missing visual data. Since all compo¬ 
nents are unified in a single objective function which 
is almost everywhere differentiable, it can be optimized 
with standard optimization techniques. Our approach 
works for monocular RGB-D sequences as well as se¬ 
tups with multiple synchronized RGB cameras. For a 
qualitative and quantitative evaluation, we captured 29 
sequences with a large variety of interactions and up to 
150 degrees of freedom. 

Keywords Hand motion capture • Hand-object 
interaction • Fingertip detection • Physics simulation 


D. Tzionas • A. Srikantha ■ P. Aponte • J. Gall 
Institute of Computer Science III, University of Bonn 
RomerstraBe 164, 53117 Bonn, Germany 
E-mail: {tzionas,srikanth,aponte,gall}@iai.uni-bonn.de 

D. Tzionas • A. Srikantha 
Perceiving Systems Department 
Max Planck institute for Intelligent Systems 
SpemannstraBe 41, 72076 Tubingen, Germany 

L. Ballan • M. Pollefeys 

Institute for Visual Computing, ETH Zurich 
UniversitatstraBe 6, CH-8092 Zurich, Switzerland 
E-mail: {Inca.ballan,marc.pollefeys}@inf.ethz.ch 


1 Introduction 


Capturing 3d motion of human hands is an important 


research topic in computer vision since decades (Frol 


et al 2007 Heap and Hogg 1996) due to its importance 


for numerous applications including, but not limited to, 
computer graphics, animation, human computer inter¬ 
action, rehabilitation and robotics. With recent tech¬ 
nology advancements of consumer RGB-D sensors, the 
research interest in this topic has increased in the last 


few years (Tompson et al 2014 Ye et al 2013). Despite 
being a special instance of full human body tracking, it 
can not be easily solved by applying known techniques 


for human pose estimation like (Shotton et al 2011) 


to human hands. While hands share some challenges 
with the full body like the high dimensionality of the 
underlying skeleton, they introduce additional difficul¬ 
ties. The body parts of the hands are very similar in 
shape and appearance, palm and forearm are difficult 
to model, and severe self-occlusions are a frequent phe¬ 
nomenon. 

Due to these difficulties, the research from the first 
efforts in the field ( Heap and Hogg] 1996) even un¬ 
til very recent approaches ( Tompson et al[ 2014) has 
mainly focused on a single isolated hand. While iso¬ 
lated hands are useful for a few applications like ges¬ 
ture control, humans use hands mainly for interacting 
with the environment and manipulating objects. In this 
work, we focus therefore on hands in action, i.e. hands 
that interact with other hands or objects. This problem 


has been addressed so far only by a few works ( 

Hamer 

et al 2009 2010 |Kyriazis and Argyros| |2013 

2014 

Oikonomidis et al 201 lb| 2012| 2014). While 

Hamer 

et al (2009) considered objects only as occluders. 

Hamer 


et al (2010) derive a pose prior from the manipulated 


objects to support the hand tracking. This approach. 
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however, assumes that training data is available to learn 
the prior. A different approach to model interactions 
between objects and hands is based on a collision or 
physical model. Within a particle swarm optimization 


ical simulation for hypothesizing the state of one or 
several rigid objects. 

Instead of employing a sampling based optimization 
approach like PSO, we propose in this work a single ob¬ 
jective function that combines data terms, which align 
the model with the observed data, with a collision and 
physical model. The advantage of our objective function 
is that it can be optimized with standard optimization 
techniques. In our experiments, we use local optimiza¬ 
tion and enrich the objective function with discrimina- 
tively learned salient points to avoid pose estimation er¬ 
rors due to local minima. Salient points, like finger tips, 
have been used in the earlier work of |Rehg and Kanade 


(1994). Differently from their scenario, however, these 


salient points cannot be tracked continuously due to the 
huge amount of occlusions and the similarity in appear¬ 
ance of these features. Therefore we cannot rely on hav¬ 
ing a fixed association between the salient points and 
the respective fingers. To cope with this, we propose a 
novel approach that solves the salient point association 
jointly with the hand pose estimation problem. 


Preliminary versions of this paper appeared in (Dal¬ 


ian et al 2012 Tzionas et al 2014). The present work 


unifies the pose estimation for multiple synchronized 
RGB cameras ( Dalian et al[ 2012) and a monocular 
RGB-D camera (Tzionas et al 2014). In addition, the 


objective function is extended by a physical model that 
increases the realism and physical plausibility of the 
hand poses. In the experiments, we qualitatively and 
quantitatively evaluate our approach on 29 sequences 
and present for the first time successful tracking re¬ 
sults of two hands strongly interacting with non-rigid 
objects. 

2 Related Work 

The study of hand motion tracking has its roots in 


the 90s (Rehg and Kanade 1995 1994). Although the 


problem can be simplified by means of data-gloves (Ek- 


vall and Kragic 2005), color-gloves (Wang and Popovic step (Keskin et al 2012 Tompson et al 2014). Most 


of local optimization in the field. Several filtering ap¬ 
proaches have been presented ( Bray et al| 2007[ 


Mac- 


Gormick and Isard| 2000[ Stenger et all 200l[ Wu et al 


2001), while also belief-propagation proved to be suit- 


(PSO) framework, Oikonomidis et al (2011b 2012) ap¬ 
proximate the hand by spheres to detect and avoid col¬ 
lisions. In the same framework, [Kyriazis and Argyros] 
(2013 2014) enrich the set of particles by using a phys- 


able for articulated objects (Hamer et all 


2009 


Sudderth et al, 2004). Oikonomidis et all (2011a) em- 


2010 


ploy Particle Swarm Optimization (PSO) as a form of 
stochastic search, while later they present a novel evo¬ 
lutionary algorithm that capitalizes on quasi-random 
sam pling ([Oikonomidis et al[ 2014). Kim et al (2012) 


and Wang and PopovidTj 2009) use inverse-kinematics, 
while Heap and Hogg (1996) and Wu et al (2001) re¬ 


duce the search space using linear subspaces. Athitsos] 
and Sclaroff (2003) resort to probabilistic line matching. 


while Thayananthan et al (2003) combine Bayesian fil¬ 


tering with Ghamfer matching. Recently, Schmidt et al 


( |2014[ ) extended the popular signed distance function 


(SDF) representation to articulated objects, while Qian 


et al| (2014) combine a gradient based ICP approach 
with PSO, showing the complementary nature of the 
two approaches. Sridhar et al| ( |2013 ) explore the use 
of a Sum of Gaussians (SoG) model for hand tracking 
on RGB images, which is later replaced by a Sum of 


Anisotropic Gaussians (Sridhar et al 2014). 


All these approaches have in common that they are 
generative models. They use an explicit model to gen¬ 
erate pose hypotheses, which are evaluated against the 
observed data. The evaluation is based on an objec¬ 
tive function which implicitly measures the likelihood 
by computing the discrepancy between the pose esti¬ 
mate (hypothesis) and the observed data in terms of 
an error metric. To keep the problem tractable, each 
iteration is initialized by the pose estimate of the pre¬ 
vious step, relying thus heavily on temporal continuity 
and being prone to accumulative error. The objective 
function is evaluated in the high-dimensional, continu¬ 
ous parameter space. Recent approaches relax the as¬ 
sumption of a fixed predefined shape model, allowing 


for online non-rigid shape deformation (Taylor et al 


2014) that enables better data fitting and user-specific 


adaptation. 

Discriminative methods learn a direct mapping from 


the observed image features to the discrete (Athitsos 


and Sclaroff 2003 

Romero et al 2009 2010) or contin- 

uous ( 

de Gampos and Murray 

2006 Rosales et al 

2001) 


target parameter space. Some approaches also segment 
the body parts first and estimate the pose in a second 


2009 ), markers (Vaezi and Nekouie 2011) or wearable 
sensors ( Kim et ^ 2012), the ideal solution pursued is 
the unintrusive, marker-less capture of hand motion. 

In pursuit of this, one of the first hand tracking ap¬ 


proaches (Rehg and Kanade 1994) introduced the use 


methods operate on a single frame, being thus immune 
to pose-drifting due to error accumulation. Generaliza¬ 
tion in terms of capturing illumination, articulation and 
view-point variation can be realized only through ade¬ 
quate representative training data. Acquisition and an- 
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Fig. 1 Qualitative results of our approach for the case of hand-hand interaction. Each pair shows the aligned RGB and depth 
input images after depth thresholding along with the pose estimate 


notation of realistic training data is though a cumber¬ 
some and costly procedure. For this reason most ap¬ 


proaches rely on synthetic rendered data (Keskin et al 


2012 Romero et al 2010) that has inherent ground- 


truth. Special care is needed to avoid over-fitting to 
the training set, while the discrepancy between realis¬ 
tic and synthetic data is an important limiting factor. 


Recent approaches (Tang et al 2013) tried to address 


the latter using transductive regression forests to trans¬ 
fer knowledge from fully labeled synthetic data to par¬ 
tially labeled realistic data. Finally, the accuracy of dis¬ 
criminative methods heavily depends on the invariance, 
repeatability and discriminative properties of the fea¬ 
tures employed and is lower in comparison to generative 
methods. 

A discriminative method can effectively complement 
a generative method, either in terms of initialization 
or recovery, driving the optimization framework away 
from local minima in the search space and aiding con¬ 


vergence to the global minimum. Sridhar et al (2013) 


combine in a real time system a Sum of Gaussians 
(SoG) generative model with a discriminatively trained 
fingertip detector in depth images using a linear SVM 
classifier. Alternatively, the model can also be combined 


with a part classifier based on random forests (Sridhar 


et al 2015). Recently, Sharp et al (2015) combined a 


PSO optimizer with a robust, two-stage regression re¬ 
initializer that predicts a distribution over hand poses 
from a single RGB-D frame. 

Generative and discriminative methods have used 
various low level image cues for hand tracking that 
are often combined, namely silhouettes, edges, shad¬ 
ing, color, optical flow ( de La Gorce et~al| [Ml] 


Lu 


et al 2003), while depth ( Bray et ^ 2007| (Delamarre 


and Faugeras, 

2001 

Hamer et al 2009 

) has recently 

gained popularity with the ubiquity of RGB-D sen- 

sors (Oikonomidis et 

al||2011a 2012| Qian et al 2014 

Schmidt et all 2014| Sridhar et al 2013| Tompson et al 


2014). In this work, we combine in a single framework 


a generative model with discriminative salient points 


detected by a Hough forest (Gall et al 2011b), i.e. a 


finger nail detector on color images and finger tip de¬ 
tector on depth images, respectively. As low level cues. 


we use edges and optical flow for the RGB sequences 
and depth for the RGB-D sequences. 


3 Pose Estimation 

Our approach for capturing the motion of hands and 
manipulated objects can be applied to RGB-D and 
multi-view RGB sequences. In both cases hands and 
objects are modeled in the same way as described in 
Section [3T1 The main difference between RGB-D and 
RGB sequences is the used data term, which depends on 
depth or edges and optical flow, respectively. We there¬ 
fore introduce first the objective function for a monoc¬ 
ular RGB-D sequence in Section |3.2| and describe the 
differences for RGB sequences in Section 


3.1 Hand and Object Models 


We resort to the popular linear blend skinning (LBS) 
model ( [Lewis et al[ [2000 ), consisting of a triangular 
mesh with an underlying kinematic skeleton, as de¬ 
picted in Figure l^-c, and a set of skinning weights. In 
our experiments, a triangular mesh of a pair of hands 
was obtained by a 3D scanning solution, while meshes 
for several objects (ball, cube, pipe, rope) were created 
manually with a 3D graphics software. Some objects 
are shown in Figure A skeletal structure defining the 
kinematic chain was manually defined and fitted into 
the meshes. The skinning weight /^v,j defines the in¬ 
fluence of bone j on 3D vertex v, where = 1. 

Figure [^visualizes the mesh using the largest skinning 
weight for each vertex as bone association. The defor¬ 
mation of each mesh is driven by its underlying skele¬ 
ton with pose parameter vector 0 through the skinning 
weights and is expressed by the LBS operator: 


v(0)=^«v,iT,(0)T,(O)-VO) 


( 1 ) 


where Tj{0) and v(0) are the bone transformations and 
vertex positions at the known rigging pose. The skin¬ 


ning weights are computed using (Baran and Popovic 


2007). 
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(d) (e) (f) 


Fig. 2 Hand model used for tracking, (a) Mesh (b) Kine¬ 
matic Skeleton (c) Degrees of Freedom (DoF) (d-f) Mesh fin¬ 
gertips (green) used for the salient point detector. The ver¬ 
tices of the fingertips are found based on the manually anno¬ 
tated red vertices. The centroid of the fingertips, as defined 
in Section [ 3 . 2. 5| is depicted with yellow color 



Fig. 4 Segmentation of the meshes based on the skinning 
weights. The ball and the cube are rigid objects while the 
pipe and rope are modeled as articulated objects. Each hand 
has 20 skinning bones, the pipe has 2, while the rope has 36 


form of a 4 X 4 matrix 


i9i = i9 
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( 2 ) 



Fig. 3 Object models used for tracking and their DoF: (top- 
left) a rigid ball with 6 DoF; (top-right) a rigid cube with 
6 DoF; (middle) a pipe with 1 revolute joint, i.e. 7 DoF; 
(bottom) a rope with 70 revolute joints, i.e. 76 DoF 


The global rigid motion is represented by a 6 DoF 


twist 1 ^^ = i9{ui^u2,us^uji,uj2^ujs) with ||ci;|| = 1 (Bre- 


gler et a H 120041 [Murray et all |1994| |Pons-Moll and 


Rosenhahn 


2011). The twist action G 5e(3) has the 


and the exponential map operator exp('i9^) defines the 
group action: 

Tm = = exp(t?|) e SE{3). (3) 


While 0 = for a rigid object, articulated objects 
have additional parameters. We model the joints by 
revolute joints. A joint with one DoF is modeled by a 
single revolute joint, i.e. the transformation of the cor¬ 
responding bone j is given by ex.p{'dj^j) 

where p{j) denotes the parent bone. If a bone does not 
has a parent bone, it is the global rigid transforma¬ 
tion. The transformation of an object with one revo¬ 
lute joint is thus described hy 0 = ('^9^, 'i^i). Joints with 
two or three DoF are modeled by a combination of Kj 
revolute joints, i.e. exp('i9j^/e^j^/c). For simplicity, 

we denote the relative transformation of a bone j by 
= n^i The global transformation 

of a bone j is then recursively defined by 


Tj{d) = Tp^yd)fj{e). 


( 4 ) 


In our experiments, a single hand consists of 31 rev¬ 
olute joints, i.e. 37 DoF, as shown in Figure The 
rigid objects have 6 DoF. The deformations of the non- 
rigid shapes shown in Figure are approximated by a 
skeleton. The pipe has 1 revolute joint, i.e. 7 DoF, while 
the rope has 70 revolute joints, i.e. 76 DoF. Thus, for 
sequences with two interacting hands we have to esti¬ 
mate all 74 DoF and together with the rope 150 DoF. 
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3.2 Objective Function 


Our objective function for pose estimation consists of 
seven terms: 


E(^0^D) = EjYiodel^datai,^•) (^ 7 -^) 

'yc^collisioni^) E (^7-^)“b 

'Iphysics (^) E anatomy 

'^rEregularization (^) 

( 5 ) 


where 0 are the pose parameters of the template meshes 
and D is the current preprocessed depth image. The 
preproccesing is explained in Section |3.2.1| The first 
two terms minimize the alignment error of the trans¬ 
formed mesh and the depth data. The alignment error is 
measured by Emodei^data^ which measures how well the 
model fits the observed depth data, and Edata^modeU 
which measures how well the depth data is explained 
by the model. Esaiient measures the consistency of the 
generative model with detected salient points in the im¬ 
age. The main purpose of the term in our framework is 
to recover from tracking errors of the generative model. 
^collision pcualizcs intersections of fingers and Ep^ysics 
enhances the realism of grasping poses during inter¬ 
action with objects. Both of the terms EcoiUsion and 
Ephysics ensure physically plausible poses and are com¬ 
plementary. The term Eanatomy enforces amatomically 
inspired joint limits, while the last term is a simple 
regularization term that prefers the solution of the pre¬ 
vious frame if there are insufficient oberservations to 
determine the pose. 

In the following, we give details for the terms of the 
objective function <§ as well as the optimization of it. 


3.2.1 Preprocessing: 


For pose estimation, we first remove irrelevant parts of 
the RGB-D image by thresholding the depth values, in 
order to avoid unnecessary processing like normal com¬ 
putation for points far away. Segmentation of the hand 
from the arm is not necessary and is therefore not per¬ 
formed. Subsequently we apply skin color segmentation 


on the RGB image (Jones and Rehg 2002). As a result 


we get masked RGB-D images, denoted as D in (§ .The 
skin color segmentation separates hands and non-skin 
colored objects, facilitating hand and object tracking 
accordingly. 


3.2.2 Fitting the model to the data - LOm 2 d- 

The first term in Equation ^ aims at fitting the mesh 
parameterized by pose parameters 0 to the preprocessed 


data D. To this end, the depth values are converted into 
a 3D point cloud based on the calibration data of the 
sensor. The point cloud is then smoothed by a bilat¬ 


eral filter (Paris and Durand 2009) and normals are 


computed (Holzer et al 2012). For each visible vertex 


of the model v^(^), with normal ni(^), we search for 
the closest point Xi in the point cloud. This gives a 
3D-3D correspondence for each vertex. We discard the 
correspondence if the angle between the normals of the 
vertex and the closest point is larger than 45° or the 
distance between the points is larger than 10 mm. We 
can then write the term Emodei^data ^ least squared 
error of point-to-point distances: 




(6) 


An alternative to the point-to-point distance is the point- 
to-plane distance, which is commonly used for 3D re- 


construction (Chen and Medioni| 

19911 Rusinkiewicz 

and Levoy 

200 1| Rusinkiewicz et al 

2002). In this case: 


Emodel^data{e,D) = ^ ||n,(0)^(vi(0) - Xi)f . (7) 

i 

The two distance metrics are evaluated in Section ld.l.ll 


3.2.3 Fitting the data to the model - LOd 2 m- 

Only fitting the model to the data is not sufficient as 
we will show in our experiments. In particular, poses 
with self-occlusions can have a very low error since 
the measure only evaluates how well the visible part 
of the model fits the point cloud. The second term 
Edata^modei{^iD) matches the data to the model to 
make sure that the solution is not degenerate and ex¬ 
plains the data as well as possible. However, matching 
the data to the model is more expensive since after each 
iteration the pose changes, which would require to up¬ 
date the data structure for matching, e.g. distance fields 
or kd-trees, after each iteration. We therefore reduce the 


matching to depth discontinuities (Gall et al 2011a). 


To this end, we extract depth discontinuities from the 
depth map and the projected depth profile of the model 


using an edge detector (Ganny 1986). Correspondences 


are again established by searching for the closest points, 
but now in the depth image using a 2D distance trans¬ 


form (Felzenszwalb and Huttenlocher 2004). Similar to 


Emodei^datai^^ D), wc discard correspondences with a 
large distance. The depth values at the depth discon¬ 
tinuities in D, however, are less reliable not only due 
to the depth ambiguities between foreground and back¬ 
ground, but also due to the noise of consumer sensors. 
The depth of the point in D is therefore computed as 
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Fig. 5 “Walking” sequence. Without the collision term un¬ 
realistic mesh intersections are observed during interactions 


the intention of the repulsion that is needed to penalize 
the intrusion. 

Let us consider the case where the vertices of ft 
are the intruders and the triangle fs is the receiver of 
the penetration. The opposite case is then similar. The 
cone for computing the 3D distance field ^ 

M+ is defined by the circumcenter of the triangle fg. 
Letting Uf^ G denote the normal of the triangle, 
Of^ G the circumcenter, and G M>o the radius of 
the circumcircle, we have 


average in a local 3x3 pixels neighborhood and the 
outlier distance threshold is increased to 30 mm. The 
approximation is sufficient for discarding outliers, but 
insufficient for minimization. For each matched point 
in D we therefore compute the projection ray uniquely 


expressed as a Pliicker line (Pons-Moll and Rosenhahn 
Rosenhahn et al 2QQ7[ Stolfi 1991) with direc 


2011 


tion di and moment and minimize the least square 
error between the projection ray and the vertex v^(^) 
for each correspondence: 


^data^modeliP•) ^ ^ ^ 


m,; 


( 8 ) 


We compared the matching based on depth disconti¬ 
nuities with a direct matching of the point cloud to the 
model using a kd-tree. The direct matching increases 
the runtime by 40% or more without reducing the er¬ 
ror. 


3 . 2.4 Collision detection - C 


Collision detection is based on the observation that two 
objects cannot share the same space and is of high im¬ 
portance in case of self-penetration, inter-finger pene¬ 
tration or general intensive interaction, as in the case 
depicted in Figure 

Collisions between meshes are detected by efficiently 
finding the set of colliding trianges C using bounding 


volume hierarchies (BVH) (Teschner et al 2004). In 


order to penalize collisions and penetrations, we avoid 
using a signed 3D distance field for the whole mesh 
due to its high computational complexity and the fact 
that it has to be recomputed at every iteration of the 
optimization framework. Instead, we resort to a more 
efficient approach with local 3D distance fields defined 
by the set of colliding triangles C that have the form 
of a cone as depicted in Figure In case of multiple 
collisions the defined conic distance fields are sumed up 
as shown in the same figure. Having found a collision 
between two triangles ft and fs , the amount of penetra¬ 
tion can be computed by the position inside the conic 
distance fields. The value of the distance field represents 




1(1 - #( v ()) r ( n /, • ( V ( - o / J )|2 <?( vt ) < 1 

0 ^(vt) > 1 

( 9 ) 


^(vt) = 


II K - O/a) - (n/a • (Vt - 0/a))n/a II 

• (vt -O/J) +r/^ 


( 10 ) 


nx) = 


-X + 1 — 
l-2cr^2 


4cr2 


- i(3-2cr) 


0 


X < —a 
X G (-cr, +cr) 
X > +cr. 


(11) 


The term ^ projects the vertex v onto the axis of the 
right circular cone defined by the triangle normal n 
going through the circumcenter o and measures the 
distance to it as illustrated in Figure The distance 
is scaled by the radius of the cone at this point. If 
^(v) < 1 the vertex is inside the cone and if ^(v) = 0 
the vertex is on the axis. The term T measures how 
far the projected point is from the circumcenter and 
defines the intensity of the repulsion. If T <0, the pro¬ 
jected point is behind the triangle. Within the range 
(—<j,+cr), the penalizer term is quadratic with values 
between zero and one. If the penetration is larger than 
\a\ the penalizer term becomes linear. The parameter 
a also defines the field of view of the cone and is fixed 
to 0.5. 

For each vertex penetrating a triangle, a repulsion 
term in the form of a 3D-3D correspondence that pushes 
the vertex back is computed. The direction of the re¬ 
pulsion is given by the inverse normal direction of the 
vertex and its intensity by iF. Using point-to-point dis¬ 
tances, the repulsion correspondences are computed for 
the set of colliding triangles C: 


collision (^) 


Yi {Y II-'^/t(v4n5|P+ 

(/a(e),/tW)ec 1 v,e/. 


Vt^ft 
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Fig. 6 (Left) Domain of the distance field generated by the face /g. (Middle) Longitudinal section of the distance field 
darker areas correspond to higher penalties. (Right) Distance fields add up in case of multiple collisions 


( 12 ) 

Though not explicitly denoted, fs and ft depend on 0 
and therefore also v and n. For point-to-plane dis¬ 
tances, the equation gets simplified since n^n = 1: 

Ecoiiisionie) = Y. I E ll-^/*h^)ll'+ 

(fs(0),ftW)eC ''VsE/s 


E ii-%K)iid 

vtG/t 


(13) 

This term takes part in the objective function ^ 
regulated by weight 7c. An evaluation of different 7c 


values is presented in Section 4.1.3 


3.2.5 Salient point detection - S 

Our approach is so far based on a generative model, 
which provides accurate solutions in principle, but re¬ 
covers only slowly from ambiguities and tracking er¬ 
rors. However, this can be compensated by integrating 
a discriminatively trained salient point detector into a 
generative model. 

To this end, we train a fingertip detector on raw 
depth data. We manually annotat^ the fingertips of 
56 sequences consisting of approximately 2000 frames, 
with 32 of the sequences forming the training and 24 


forming the testing set. We use a Hough forest (Gall 


et al 2011b) with 10 trees, each trained with 100000 


randomly generate a set of 20000 binary tests. Testing 
is performed at multiple scales and non-maximum sup¬ 
pression is used to retain the most confident detections 
that do not overlap by more than 50%. 

Since we resort to salient points only for additional 
robustness, it is usually sufficient to have only sparse 
fingertip detections. We therefore collect detections with 
a high confidence, choosing a threshold of Cthr = 3.0 for 
our experiments. The association between the T finger¬ 
tips 4>t of the model depicted in Figure (d-f) and the 
S detections 6 s is solved by integer programming (Be- 


longie et al 2002): 


arg mm 

6st )/5t 

subject to 


^st'^st +^ c^sWs +A Pt 
s,t s t 

y^est+A = l VtG {!,...,T} 
y^ ^st +tts = 1 Vs G {1,..., s} 




(14) 


positive and 100000 negative patches. The negative patches 
are uniformly sampled from the background. The trees 
have a maximal depth of 25 and a minimum leaf size of 
20 samples. Each patch is sized 16 x 16 and consists of 
11 feature channels: 2 channels obtained by a 5 x 5 min- 
and max-filtered depth channel and 9 gradient features 
obtained by 9 HOG bins using a 5 x 5 cell and soft bin¬ 
ning. As for the pool of split functions at a test node, we 

^All annotated sequences are available at http://files. 
is.tue.mpg.de/dtzionas/hand-obj ect-capture.html 


As illustrated in Table Cgt = 1 defines an assignment 
of a detection 6 s to a fingertip fa- The assignment cost 
is defined by Wst- If ttg = 1, the detection 6 s is declared 
as a false positive with cost Xws and if Pt = 1, the 
fingertip (ft is not assigned to any detection with cost 
A. 

The weights Wst are given by the 3D distance be¬ 
tween the detection 6 s and the finger of the model ft- 
For each finger 0^, a set of vertices are marked in the 
model. The distance is then computed between the 3D 
centroid of the visible vertices of <ft (Figure [^-f) and 
the centroid of the detected region 6 s. The latter is 
computed based on the 3D point cloud 6 ^ correspond¬ 
ing to the detection bounding box. For the weights Ws, 
we investigate two approaches. The first approach uses 
Ws = 1. The second approach takes the confidences Cs 
of the detections into account by setting Ws = . The 


weighting parameter A is evaluated in Section 4.1.2 
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Table 1 The graph contains T mesh fingertips (/)t and S fingertip detections 6s. The cost of assigning a detection 5s to a 
fingertip (/)t is given by Wst as shown in table (a). The cost of declaring a detection as false positive is AiCg, where Ws is the 
detection confidence. The cost of not assigning any detection to fingertip (pt is given by A. The binary solution of table (b) is 
constrained to sum up to one for each row and column 


(a) 

Fingertips St 

V 

Si 

S 2 


St 

a 

Detections 5s 

<5i 

Wll 

W 12 


Wit 

Xwi 

52 

W21 

W 22 


W2T 

XW2 

53 

W31 

W 32 


W3T 

XW3 







5s 

wsi 

WS2 


WST 

Xws 


/3 

A 

A 


A 

00 


(b) 

Fingertips St 

V 

Si 

S2 


St 

a 

Detections 5s 

Si 

eii 

612 


eiT 

ai 

52 

621 

622 


62T 

6^2 

53 

631 

632 


63T 

OL 3 







5 s 

CSl 

CS2 


esT 

as 


/3 

/3i 

t^2 


St 

0 



(a) 


(b) 


Fig. 7 Correspondences between the fingertips (/)t of the 
model and (a) the closest points of the associated detections 
(5g (b) the centroids of the associated detections (5' 


If a detection 5s has been associated to a fingertip 
we have to define correspondences between the set 
of visible vertices of (pt and the detection point cloud 5'^. 
If the fingertip (pt is already very close to 5'^, i.e. Wgt < 
10 mm, we do not compute any correspondences since 
the localization accuracy of the detector is not higher. 
In this case just the close proximity of the fingertip 
to the data suffices for a good alignment. Otherwise, 
we compute the closest points between the vertices 
and the points Xi of the detection 5'^ as illustrated in 
Figure [7a| 

E,aii,nt{0,D) = Y,est( Y. 

(15) 


As in 0. a point-to-plane distance metric can replace 
the point-to-point metric. When less than 50% of the 
vertices of (pt project inside the detection bounding box, 
we even avoid the additional step of computing cor¬ 
respondences between the vertices and the detection 
point cloud. Instead we associate all vertices with the 
centroid of the detection point cloud as shown in Fig¬ 
ure [7bl 



(c) (d) 

Fig. 8 Physical plausibility during hand-object interaction, 
(a) Input RGB-D image, (b-c) Obtained results without the 
physics component, (d) Obtained results with the physics 
component, ensuring a more realistic pose during interaction 


3.2.6 Physics Simulation - V 


A phenomenon that frequently occurs in the context of 
hand-object interaction are physically unrealistic poses 
due to occlusions or missing data. Such an example is 
illustrated in Figure where a cube is grasped and 
moved by two fingers. Since one of the fingers that is in 
contact with the cube is occluded, the estimated pose 
is physical unrealistic. Due to gravity, the cube would 
fall down. 

In order to compensate for this during hand-object 
interaction scenarios, we resort to physics simulation 


(Coumans 2013) for additional realism and physical 


plausibility. To this end, we model the static scene as 
well and based on gravity and the parameters friction, 
restitution, and mass for each object we can run a 
physics simulation. To speed up the simulation, we rep¬ 
resent each body or object part defined by the skinning 
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weights as shown in Figure as convex hulls. This is 
visualized in Figure 

Given current pose estimates of the hands and the 
manipulated object, we first evaluate if the current so¬ 
lution is physically plausible. To this end, we run the 
simulation for 35 iterations with a time-step of 0.1 sec¬ 
onds. If the centroid of the object moves by less than 
3mm we consider the solution as stable. Otherwise, we 
have to search for the hand pose which results in a more 
stable estimate. Since it is intractable to evaluate all 
possible hand poses, we search only for configurations 
which require a minor change of the hand pose. This 
is a reasonable assumption for our tracking scenario. 
To this end, we first compute the distances between all 
parts of the fingers, as depicted in Figure [T^ and the 


object (Aggarwal et al 1987 Gartner and Schonherr 


2000). Each finger part with distance less than 10mm is 


then considered as candidate for being in contact with 
the object and each combination of at least two and 
maximum four candidate parts is taken into account. 

The contribution of each combination to the stabil¬ 
ity of the object is examined through the physics sim¬ 
ulation after rigidly moving the corresponding finger 
parts towards the closest surface point of the object. 
Figure illustrates the case for a combination of two 
finger parts. The simulation is repeated for all combi¬ 
nations and we select the combination with the lowest 
movement of the object, i.e. the smallest displacement 
of its centroid from the initial position. Based on the 
selected combination, we define an energy that forces 
the corresponding finger parts to get in contact with 
the object by minimizing the closest distance between 
the parts i and the object: 

Ephvsics{0) = Y.hm - Xif (16) 


The vertices and Xi correspond to the closest point 
of a finger part and the object, respectively. As in 0, 
a point-to-plane distance metric can replace the point- 
to-point metric. 


3.2.7 Anatomical limits 



Fig. 9 Low resolution representation of the hands and ob¬ 
jects for the physics simulation. In order to predict the finger 
parts (green) that give the physically most stable results if 
they were in contact with the object (white), all combina¬ 
tions of finger parts close to the object are evaluated. The 
image shows how two finger parts are moved to the object for 
examining the contribution to the stability of the object. The 
stability is measured by a physics simulation where all green 
parts are static 



Fig. 10 Finger parts that form all possible supporting com¬ 
binations in the physics simulation component. Parts with 
red color do not take part in this process 


Angle Limit Constraint Angle Limit Constraint 



Fig. 11 Angle limits are independently defined for each 
revolute joint. The plot visualizes the function (a), and its 
truncated derivative (b), that penalizes the deviation from 
an allowed range of ±20.0 degree 


Anatomically inspired joint-angle limits (Albrecht et al 


|2QQ3[ ) are enforced as soft constraints by the term: 


^anatomy E (exp {p{lk - 6k)) + exp {p{6k - Uk))) 
k 

(17) 

where p = 10. The index k goes over all revolute joints 
and [ukJk] is the allowed range for each of them. The 
term is illustrated for a single revolute joint in Figure 
\TT\ We use 7a = 0.0015 Caii^ where Caii is the total 
number of correspondences. 


3.2.8 Regularization 


In case of occlusions or due to missing depth data, 
the objective function (§ based solely on the previous 
terms can be ill-posed. We therefore add a term that 
penalizes deviations from the previous estimated joint 
angles 0 : 




regularization 


(^) = E^^'' “ 


(18) 


We use 7 ^ = 0.02 Caih 
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Algorithm 1: Pose estimation for RGB-D data 
with point-to-point distances 

6 = pose estimate of the previous frame 

2 = 0, 6>o = 0 

Repeat until convergence or max ithr iterations 


- Render meshes at pose 0 


- Find corresp. LOm 2 d 

Section 

3.2.2 

- Eq. ( 

§ 

- Find corresp. LOd 2 m 

Section 

3.2.3 

- Eq. ( 


- Find corresp. C 

Section 

3.2.4 

- Eq. ( 

12 

- Find corresp. S 

Section 

3.2.5 

- Eq. ( 

15 

- Find corresp. V 

Section 

3.2.6 

- Eq. ( 

16 


6 i+i = arg mine E{9, D) 
2 = 2 + 1 


3.2.9 Optimization 


For pose estimation, we alternate between computing 
the correspondences LOm 2 d (Section 3.2. 5 LO(ji 2 m (Sec¬ 

tion |3.2.3| ), C (Section 3.2.4| ), S (Section |3.2.5| ), and 
V (Section 3.2.6) according to the current pose esti¬ 
mate and optimizing the objective function (§ based 
on them as summarized in Algorithmic This process is 
repeated until convergence or until a maximum num¬ 
ber of iterations ithr is reached. It should be noted that 
the objective function E{0^ D) is only differentiable for 
a given set of correspondences. We optimize E{0^D)^ 
which is a non-linear least squares problem, with the 


Gauss-Newton method as in (Brox et al 2010). 


The only major change is required for the data terms 
E models data {0 .) and Ehata^modeliPiD^ in <§• The 
term Edata^modei{(^, D) is replaced by an edge term 
that matches edge pixels in all camera views to the 
edges of the projected model in the current pose 0. As 
in the RGB-D case, the orientation of the edges is taken 
into account for matching and mismatches are removed 
by thresholding. Working with 2D distances though has 
the disadvantage of not being able to apply intuitive 


3D distance thresholds, as presented in Section 3.2.3 


In order to have an alternative rejection mechanism of 
noisy correspondences, we compute for each bone the 
standard deviation of the 2D error that is suggested 
by all of its correspondences. Subsequently, correspon¬ 
dences that suggest an error bigger than twice this stan¬ 
dard deviation are rejected as outliers. The second term 
Emodelsdata{^^ D) is replaced by a term based on opti¬ 
cal flow as in ( [Ballan and Gortelazzo 2008). The term 
introduces temporal consistency and harness the higher 
resolution and frame rates of the RGB data in compar¬ 
ison to the RGB-D data. 


4 Experimental Evaluation 


3.3 Multicamera RGB 

The previously described approach can also be applied 
to multiple synchronized RGB videos. To this end, the 
objective function ^ needs to be changed only slightly 
due to the differences of depth and RGB data. While 
the error is directly minimized in 3D for RGB-D data, 
we minimize the error for RGB images in 2D since all 
our observations are 2D. Instead of using a 3D point- 
to-point (§ or point-to-plane 0 measure, the error is 
therefore given by 

(19) 


where lie • ^ are the known projection func¬ 

tions, mapping 3D points into the image plane of each 
static camera c, and (v^,Xi^c) is a correspondence be¬ 
tween a 3D vertex and a 2D point. Furthermore, the 
salient point detector, introduced in Section [3.2.5[ is not 
applied to the depth data but to all camera views. Since 
multiple high resolution views allow to detect more dis¬ 
tinctive image features, we do not detect finger tips but 
finger nails in this case. 


Benchmarking in the context of 3D hand tracking re¬ 
mains an open problem (Frol et al| |2007|) despite re¬ 
cent contributions dQian et al 


2014 


Sridhar et al 2013 


Tang et a l j 2013| |2014 Tompson et all |2014| |Tzi 


zionas 


and Gall||20i3[ 


Tzionas et al 


|2014 ). The vast majority 


of them focuses on the problem of single hand track¬ 
ing, especially in the context of real-time human com¬ 
puter interaction, neglecting challenges occurring dur¬ 
ing the interaction between two hands or between hands 
and objects. For this reason we captured 29 sequences 
in the context of hand-hand and hand-object interac¬ 
tion. The sequences were captured either with a single 
RGB-D camera or with 8 synchronized RGB cameras. 
While 20 sequences have been used in the preliminary 
works (Ballan et al 2012 Tzionas et al 2014), which 
include interactions with rigid objects, the 9 newly cap¬ 
tured sequences also include interactions with non-rigid 
objects. 


We first evaluate our approach on RGB-D sequences 
with hand-hand interactions in Section [4T| Sequences 
with hand-object interactions are used in Section |4.2| 
for evaluation and finally our approach is evaluated on 
sequences captured with several RGB cameras in Sec¬ 
tion [441 
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Fig. 12 Hand joints used for quantitative evaluation. Only 
the green joints of our hand model are used for measuring the 
pose estimation error 


4.1 Monocular RGB-D - Hand-Hand Interactions 


Related RGB-D methods (Oikonomidis et al 2011a) 
usually report quantitative results only on synthetic se¬ 
quences, which inherently include ground-truth, while 
for realistic conditions they resort to qualitative results. 

Although qualitative results are informative, quan¬ 
titative evaluation based on ground-truth is of high 
importance. We therefore manually annotated 14 se¬ 
quences, 11 of which are used to evaluate the com¬ 
ponents of our pipeline and 3 for comparison with 
the state-of-the-art method (Oikonomidis et al 2011a). 
These sequences contain motions of a single hand and 
two interacting hands with 37 and 74 DoF, respectively. 
They vary from 100 to 270 frames and contain sev¬ 
eral actions, like ^Walking^^Crossing^\ ‘^Crossing and 
Twisting‘‘Tips Touching“Dancing^\ “Tips Blend¬ 
ing”^ “Hugging”^ “Grasping”^ “Flying”^ as well as per¬ 
forming the “Rock” and “Bunny” gestures. As indicator 
for the accuracy of the annotations, we measured the 
standard deviation of 4 annotators, which is 1.46 pix¬ 
els. All sequences were captured in 640x480 resolution 
at 30 fps with a Primesense Garmine 1.09 camera. 


The error metric for our experiments is the 2D dis¬ 
tance (pixels) between the projection of the 3D joints 
and the corresponding 2D annotations. The joints taken 
into account in the metric are depicted in Figure 
Unless explicitly stated, we report the average over all 
frames of all relevant sequences. 


Our system is based on an objective function con¬ 
sisting of five terms, described in Section |3.2[ Two of 
them minimize the error between the posed mesh and 
the depth data by fitting the model to the data and the 
data to the model A salient point detector further con¬ 
strains the pose using fingertip detections in the depth 
image, while a collision detection method contributes 
to realistic pose estimates that are physically plausible. 
The function is complemented by the physics simula¬ 
tion component, that contributes towards more realistic 
interaction of hands with objects. However, this com¬ 
ponent is only relevant for hand-object interactions and 
thus it will be studied in detail in Section IT^ In the 


Table 2 Evaluation of point-to-point {p2p) and point-to- 
plane (p2plane) distance metrics, along with iterations num¬ 
ber of the optimization framework, using a 2D distance error 
metric (px). The highlighted setting is used for all other ex¬ 
periments 


Iterations 

5 

10 

15 

20 

30 

p2p 

7.33 

5.25 

5.05 

4.98 

4.91 

p2plane 

5.33 

5.12 

5.08 

5.07 

5.05 


Table 3 Evaluation of the weighting parameter A in Inl. 
using a 2D distance error metric (px). Weight A = 0 corre¬ 
sponds to the objective function without salient points, noted 
as “LO + C” in Table Both versions of Wg described in Sec¬ 
tion |3.2.5| are evaluated. The highlighted setting is used for 
all other experiments 


A 

0 

0.3 

0.6 

0.9 

1.2 

1.5 

1.8 

Wg = 1 

5.17 

5.17 

5.15 

5.14 

5.12 

5.12 

5.23 

Wg = 

_ ^thr _ 

5.14 

5.12 

5.12 

5.12 

5.22 

5.61 


following, we evaluate each component and the param¬ 
eters of the objective function (§■ 

4 . 1.1 Distance Metrics 

Table [^presents an evaluation of the two distance met¬ 
rics presented in Section |3.2.2| namely point-to-point 
(§ and point-to-plane 0, along with the number of 
iterations of the minimization framework. The point- 
to-plane metric leads to an adequate pose estimation 
error with only 10 iterations, providing a significant 
speed gain compared to point-to-point. If the number 
of iterations does not matter, the point-to-point metric 
is preferable since it results in a lower error and does 
not suffer from wrongly estimated normals. 

For the first frame, we perform 50 iterations in order 
to ensure an accurate refinement of the manually initial¬ 
ized pose. For the chosen setup, we measure the runtime 
for the sequence “Bunny” that contains one hand and 
for the sequence “Crossing and Twisting” that contains 
two hands. For the first sequence, the runtime is 2.82 
seconds per frame, of which 0.12 seconds are attributed 
to the salient point component S and 0.65 to the colli¬ 
sion component C. For the second sequence, the runtime 
is 4.96 seconds per frame, of which 0.05 seconds are at¬ 
tributed to the component S and 0.36 to the component 
C. 

4 . 1.2 Salient Point Detection - S 

The salient point detection component depends on the 
parameters Wg and A, as described in Section |3.2.5 
Table summarizes our evaluation of the parameter 
A spanning a range of possible values for both cases 
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Fig. 13 Precision-recall plot for (a) our RGB-D dataset 
and (b) the Dexter dataset. We show the performance of a 
fingertip detector trained only on depth (blue) and only rgb 
(red) images. The area under the curve (AUC) for our dataset 
(a) is 0.19 and 0.55 respectively. The AUC for the Dexter 
dataset (b) is 0.95 


Table 4 Evaluation of collision weights 7c, using a 2D dis¬ 
tance error metric (px). Weight 0 corresponds to the objec¬ 
tive function without collision term, noted as “LO + 5” in 
Table Sequences are grouped in 3 categories: “set’ere” for 
intense, “some” for light and “no apparent” for impercepti¬ 
ble collision. “> some” is the union of “serere” and “some”. 
The highlighted value is the default value we use for all other 
experiments 


7 c 

0 

1 

2 

3 

5 

7.5 

10 

12.5 

All 

5.34 

5.44 

5.57 

5.16 

5.12 

5.12 

5.12 

5.14 

Severe 

5.90 

6.07 

6.27 

5.62 

5.56 

5.57 

5.55 

5.61 

> Some 

5.44 

5.57 

5.72 

5.23 

5.18 

5.19 

5.18 

5.22 

Some 

3.99 

3.98 

3.98 

3.98 

3.98 

3.99 

3.99 

3.98 


to. = 1 and Ws = The differences between the 

two versions of Ws is minor although the optimal range 
of A varies for the two versions. The latter is expected 
since > 1 and smaller values of A compensate for 
the mean difference to tOg = 1 in (14). If A = 0 all de¬ 
tections are classified as false positives and the salient 
points are not used in the objective function 

To evaluate the performance of the detector, we 


follow the PASCAL-VOC protocol (Everingham et al 


2010). Figure 13a shows the precision-recall plot for 


our RGB-D dataset including all hand-hand and hand- 
object sequences. The plot shows that the detector does 
not perform well on this dataset and suffers from the 
noisy raw depth data. This also explains why the salient 
term improves the pose estimation only slightly. We 
therefore trained and evaluated the detector also on the 
RGB data. In this case, the detection accuracy is much 
higher. We also evaluated the detector on the Dexter 
dataset (Sridhar et al 2013). On this dataset, the de¬ 
tector is very accurate. Our experiments on Dexter in 
Section 14.1.61 and a multi-camera RGB dataset in Sec¬ 
tion 4.4 will show that the salient points reduce the 
error more if the detector performs better. 


Table 5 Comparison of the proposed collision term based on 
3D distance fields with correspondences between vertices of 
colliding triangles 



Corresponding vertices 

Distance fields 

All 

6.66 

5.12 

Severe 

7.96 

5.55 

> Some 

7.04 

5.18 

Some 

4.12 

3.99 


4 . 1.3 Collision Detection - C 

The impact of the collision detection component is reg¬ 
ulated in the objective function by the weight 7c. 
For the evaluation, we split the sequences in three sets 
depending on the amount of observed collision: severe, 
some, and no apparent collision. The set with severe 
collisions comprises ^Walking^‘Crossing^\ “Crossing 
and Twisting“Daneing^\ “Huggingsome collisions 
are present in “Tips Touching “Roc¥\ “Bunny and 
no collisions are apparent in “Grasping “Tips Blend- 
ing^\ “Flying Table summarizes our evaluation ex¬ 
periments for the values of 7c. The results show that 
over all sequences, the collision term reduces the error 
and that a weight 7 c > 3 gives similar results. For small 
weights 0 < 7 c < 3, the error is even slightly increased 
compared to 7c = 0. In this case, the impact is too 
small to avoid collisions and the term only adds noise 
to the pose estimation. As expected, the impact of the 
collision term is only observed for the sequences with 
severe collision. 

The proposed collision term is based on a fast ap¬ 
proximation of the distance field of an object. It is con¬ 
tinuous and less sensitive to a change of the mesh res¬ 
olution than a repulsion term based on 3D-3D corre¬ 
spondences between vertices of colliding triangles. To 
show this, we replaced the collision term by correspon¬ 
dences that move vertices of colliding triangles towards 
the counterpart. The results in Table show that such 
a simple repulsion term performs poorly. 


4 . 1.4 Component Evaluation 

Table presents the evaluation of each component 
and the combination thereof. Simplified versions of 
the pipeline, fitting either just the model to the data 
{LOm2d) OT the data to the model {LOd2m) can lead to 
a collapse of the pose estimation, due to unconstrained 
optimization. Our experiments quantitatively show the 
notable contribution of both the collision detection and 
the salient point detector. The best overall system per¬ 
formance is achieved with all four components of the 
objective function (§. The fifth term Ephysics is only 
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Table 6 Evaluation of the components of our pipeline. “LO” stands for local optimization and includes fitting both data- 
to-model (d2m) and model-to-data (m2d), unless otherwise specified. Collision detection is noted as “C”, while salient point 
detector is noted as “tS”. The number of sequences where the optimization framework collapses is noted in the last row, while 
the mean error is reported only for the rest 


Components 

1 l^Cm2d 1 l^C(i2m \ 

LO 

LO + C 

LO + S 

LO + CS 

Mean Error (px) 

1 27.17 1 - 1 

5.53 

5.17 

5.34 

5.12 

Improvement (%) 


- 

6.46 

3.44 

7.44 

Failed Sequenees 

1 1711 1 11/11 1 

0/11 

0/11 

0/11 

0/11 


Table 7 Pose estimation error for each sequence 



Walking 

Crossing 

Crossing 

Twisting 

Tips 

Touching 

Dancing 

Tips 

Blending 

Hugging 

Grasping 

Flying 

Rock 

Bunny 

Mean Error (px) 

5.99 

4.53 

4.76 

3.65 

6.49 

4.87 

5.22 

4.37 

5.11 

4.44 

4.50 

Standard Deviation (px) 

3.65 

2.99 

3.51 

2.21 

3.70 

2.97 

3.42 

2.06 

2.77 

2.63 

2.61 

Max Error (px) 

24.19 

18.03 

22.80 

13.60 

20.25 

18.36 

20.03 

11.05 

15.03 

14.76 

10.63 



Fig. 14 Qualitative comparison with (|Oikonomidis et al 
|2011a| |. Each image pair corresponds to the pose estimate of 
the FORTH tracker (up) and our tracker (down) 





relevant for hand-object interactions and will be eval¬ 
uated in Section 14.21 Table 0 shows the error for each 
sequence. Figure which is at the end of the article, 
depicts qualitative results for 8 out of the 11 sequences. 
It shows that the hand motion is accurately captured 
even in cases of close interaction and severe occlusions. 
The data and videos are available 0 


4.1.5 Comparison to State-of-the-Art 


Recently, Oikonomidis et ~al| ( |2Qlla|b[ |2Q12| ) used par¬ 
ticle swarm optimization (PSO) for a real-time hand 
tracker. For comparison we use the software released for 


^All annotated sequences are available at http://files. 
is.tue.mpg.de/dtzionas/hand-obj ect-capture.html 


tracking a single hand (Oikonomidis et al 2011a), with 


the parameter setups used also in the other works. Each 
setup is evaluated three times in order to compensate 
for the manual initialization and the inherent random¬ 
ness of PSO. Qualitative results depict the best result 
of all three runs, while quantitative results report the 
average error. Table shows that our system outper¬ 
forms (Oikonomidis et al 2011a) in terms of tracking 


accuracy. Figure 14 shows a visual comparison. How¬ 
ever, it should be noted that the GPU implementation 


of (Oikonomidis et al 2011a) runs in real time using 25 


generations and 64 particles, in contrast to our single- 
threaded CPU implementation. 

4.1.6 Dexter dataset 


We further evaluate our approach on the recently in¬ 


troduced Dexter dataset (Sridhar et al 2013). As sug¬ 


gested in (Sridhar et al 2013), we use the first part of 


the sequences for evaluation and the second part for 
training. More specifically, the evaluation set contains 
the frames 018—158 of the sequence “adbadd”, 061 — 185 
of “fingercount”, 020 —173 of “fingerwave”, 025 — 224 of 
“fiexexl”, 024 - 148 of “pinch”, 024 - 123 of “random”, 
and 016 — 166 of “tigergrasp”. We use only the depth 
of the Time-of-Flight camera. 

The performance of our tracker is summarized in Ta¬ 
bles and Since the dataset does not provide a hand 
model, we simply scaled our hand model in (x, y, z) di¬ 
rection by (0.95,0.95,1). Since the annotations in the 
dataset do not correspond to anatomical landmarks but 
are close to the finger tips, we compare the annotations 
with the endpoints of our skeleton. Table shows the 
error of our tracker for each of the sequences, reporting 
the mean, the maximum, and the standard deviation 
of the error over all the tested frames. Despite of the 
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Table 8 Comparison with (Oikonomidis et al 


were used in the referenced literature of the last column 


2011a). We evaluate the FORTH tracker with 4 parameter settings, 3 of which 



Mean (px) 

St.Dev (px) 

Max (px) 

Generations 

Partieles 

Ref even ee 

O 

set 1 

8.58 

5.74 

61.81 

25 

64 

Oikonomidis et al 

(2011a 

) 

set 2 

8.32 

5.42 

57.97 

40 

64 

Oikonomidis et al 

(2011b 

) 

set 3 

8.09 

5.00 

38.90 

40 

128 


set 4 

8.16 

5.18 

39.85 

45 

64 

Oikonomidis et al (|2012) 

Proposed 

3.76 

2.22 

19.92 



Table 9 Pose estimation error of our tracker for each se¬ 
quence of the Dexter dataset. 


LO + SC 

Mean Error 

St. Deviation 

Max Error 


Adbadd 

17.34 

15.35 

69.73 

a 

a 

Fingereount 

11.94 

7.18 

47.77 

Fingerwave 

10.88 

5.47 

49.62 

Flexexl 

11.87 

12.86 

91.70 

Pineh 

24.19 

28.34 

131.97 


Random 

96.93 

122.34 

559.37 

Tigergrasp 

11.77 

5.36 

30.18 


Adbadd 

7.79 

8.38 

42.54 

Fingereount 

6.03 

5.39 

38.28 

Fingerwave 

4.45 

2.80 

15.26 

Flexexl 

5.24 

8.37 

61.40 

Pineh 

12.56 

16.48 

73.16 

Random 

59.93 

77.77 

307.00 

Tigergrasp 

6.84 

4.22 

21.21 


differences of our hand model and the data, the average 
error is for most sequences only around 1cm. Our ap¬ 
proach, however, fails for the sequence “random” due 
to the very fast motion in the sequence. 

Table p!Q| presents the evaluation of each component 
of our pipeline and the combination of them. On this 
dataset, both the collision term as well as the salient 
point detector reduce the error. Compared to Table 
the error is more reduced. In particular, the salient 
point detector reduces the error more since the detector 
performs well on this dataset as shown in Figure [T^ 
Compared to “LO”, the average error of “LO + 5'(7” is 
by more than 3.5mm lower. The average error reported 
by Sridhar et al (2013) on the slow part of the Dexter 
dataset is 13.1 mm. 


4.2 Monocular RGBD - Hand-Object Interactions 


Table 10 Evaluation of the components of the objective 
function ^ on the Dexter dataset. “LO” stands for local op¬ 
timization and includes fitting both data-to-model and model- 
to-data. Collision detection is noted as “d” and salient point 
detector as “5”. The “random” sequence is excluded because 
our approach fails due to very fast motion 


Gomponents 

Mean Error 

St. Dev. 


LO + SC 

14.26 

14.91 


LO F S 

15.51 

16.67 

a 

LO + C 

16.97 

16.60 

LO 

17.86 

18.80 

LO + SC 

6.90 

8.88 

Pi 

LO F S 

7.64 

9.87 

LO F C 

8.98 

10.29 

LO 

9.33 

10.73 



Fig. 15 Failure case due to missing data and detection er¬ 
rors. The images show RGB image (top-left), input depth 
image (top-right), fingertip detections (bottom-left), and es¬ 
timated pose (bottom-right). The detector operates on the 
raw depth image, while the RGB image is used just for visu¬ 
alization 


For the evaluation of the complete energy function 
(§ for hand-object interactions, we captured 7 new 
sequenced of hands interacting with several objects, 
either rigid {hall^ cube) or articulated {pipe, rope). The 
DoF of the objects varies a lot. The rigid objects have 
6 DoF, the pipe 7 DoF, and the rope 76 DoF. The se¬ 
quences vary from 180 to 400 frames and contain several 


actions, like: ^^Moving a BalF with one (43 DoF) or two 
hands (80 DoF), “Moving a Cube^’ with one hand (43 
DoF), “Bending a Pipe^^ with two hands (81 DoF), and 
“Bending a Rope^^ with two hands (150 DoF). In addi¬ 
tion, the sequences “Moving a BalF with one hand and 
“Moving a Cube^^ were captured twice, one with occlu- 
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Table 11 Evaluation of the friction value of both the hands 
and the object. We report the error over all the frames of 
all seven sequences with hand-object interactions using a 2 d 
error metric (px). Value 3.0 is the same as the friction value 
of the static scene. The highlighted value is the default value 
we use for all other experiments 


Friction 

0.6 

0.9 

1.2 

1.5 

3.0 


Mean 

6.19 

6.18 

6.19 

6.17 

6.17 

X 

St. Dev. 

3.82 

3.81 

3.81 

3.81 

3.81 


sion of a manipulating finger and one without. Man¬ 
ual ground-truth annotation was performed by a single 
subject. 

For the salient point (5) and the collision detection 
component (C), we use the parameter setup presented 
in Section The influence of the physics simulation 
component {V) and its parameters are evaluated in the 
following section. The error metric used is the 2D dis¬ 
tance (pixel units) between the projection of the 3D 


joints and the 2D annotations as in Section 4.1 and vi¬ 
sualized in Figure[^ Unless otherwise stated, we report 
the average over all frames of all seven sequences. 


4.2.1 Physics Simulation - V 


For the physics simulation, we model the entire scene, 
which includes the hands as well as manipulated and 
static objects, with a low resolution representation as 
described in Section |3.2.6| and visualized in Figure 
Each component of the scene is characterized by three 
properties: friction, restitution, and mass. Since in each 
simulation step we consider each component except of 
the manipulated object as static, only the mass of the 
object is relevant, which we set to 1 kg. We set the 
restitution of the static scene and hands to 0 and of 
the object to 0.5. For the static scene, we use a friction 
value of 3. The friction for both the hand and the object 
are assumed to be equal. Since the main purpose of the 
physics simulation is to evaluate if the current pose es¬ 
timates are physical stable, the exact values for friction, 
restitution, and mass are not crucial. To demonstrate 
this, we evaluate the impact of the friction value for 
hands and manipulated objects. For this experiment, 
we set the weight ^ph equal to 10.0, being the same as 
the weight 7c of the complementary collision detection 
component. The results presented in Table pT] show that 
the actual value of friction has no significant impact on 
the pose estimation error as long as it is in a reasonable 
range. 

The impact of the physics simulation component 
Ephysics in the objective function (§ is regulated by 
the weight ^ph. The term penalizes implausible manip¬ 
ulation or grasping poses. For the evaluation, we split 


Table 12 Evaluation of collision weights 7 ^^ for “LO + 
SCV ^^, using a 2D distance error metric (px). Weight 0 corre¬ 
sponds to the objective function without physics term, noted 
as “LO + SC^ in Table |13[ Sequences are grouped in 3 cate¬ 
gories: “se?;ere” for intense, ^^some’^ for light and “no appar¬ 
ent for imperceptible occlusion of manipulating fingers. “> 
some’^ is the union of “ 5 e?;ere” and ^^some ’^. The highlighted 
value is the default value we use for all other experiments 


Ifph 

0 

1 

2 

3 

5 

7.5 

10 

12.5 

All 

6.21 

6.20 

6.21 

6.19 

6.19 

6.18 

6.19 

6.17 

Severe 

5.68 

5.66 

5.65 

5.63 

5.63 

5.63 

5.62 

5.61 

> Some 

6.02 

6.00 

6.00 

5.98 

5.98 

5.97 

5.96 

5.94 


the sequences in three sets depending on the amount 
of occlusions of the manipulating fingers: “severe” for 
intense (“Moving a Cube” with one hand and occlu¬ 
sion), “some” for light (“Moving a Ball” with one hand 
and occlusion, “Moving a Cube” with one hand) and 
“no apparent for imperceptible occlusions ( “Moving 
a Bair with one and two hands, “Bending a Pipe^\ 
“Bending a Rope^ 


). Table 
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summarizes the pose esti¬ 
mation error for various values of ^ph for the three sub¬ 
sets. Although the pose estimation error is only slightly 
reduced by Ephysics^ the results are physically more 
plausible. This is shown in Figure at the end of 
the article, which provides a qualitative comparison be¬ 
tween the setups “LO + 5C” and “LO + SCV'\ The 
images show the notable contribution of component V 
towards more realistic, physically plausible poses, es¬ 
pecially in cases of missing or ambiguous visual data, 
as in sequences with an occluded manipulating finger. 
To quantify this, we run the simulation for 35 iterations 
with a time-step of 0.1 seconds after the pose estimation 
and measured the displacement of the centroid of the 
object for each frame. While the average displacement 
is 9.26mm for the setup “LO + SC\ the displacement 
is reduced to 9.05mm by the setup “LO + SCV'\ The 
tracking runtime for the aforementioned sequences for 
the setup “LO + 5C” ranges from 4 to 8 seconds per 
frame. The addition of V in the setup “LO + SCV’^ 
increases the runtime for most sequences for about 1 
second. However, this increase might reach up to more 
than 1 minute depending on the complexity of the ob¬ 
ject and tightness of interaction, as in the case of “Bend¬ 
ing a Pipe^^ with two hands (150 DoF), with the main 
bottleneck being the computation of the closest finger 
vertices to the manipulated object. Figure |2Q| depicts 
qualitative results for the full setup “LO + 5C7^” of the 
objective function (§ for all seven sequences. The re¬ 
sults show successful tracking of interacting hands with 
both rigid and articulated objects, whose articulation 
is described from 1 to as many as 71 DoF. 



































Percentage (%) of Converged Frames Percentage (%) of Converged Frames 
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Table 13 Evaluation of the components of the objective 
function il- “LO” stands for local optimization and includes 
fitting both data-to-model and model-to-data. Collision de¬ 
tection is noted as “C”, salient point detector as and 
physics simulation as “P”. We report the error for fixed 10 
iterations and for the stopping criterion e < 0.2mm 



Fixed 10 
Iterations 

Stopping Thresh. 
0.2 mm 

Components 

Mean 

St. Dev. 

Mean 

St. Dev. 

LO -h SCV 

6.19 

3.81 

6.25 

3.86 

LO -h SC 

6.21 

3.82 

6.31 

3.89 

LO -h 5 

6.05 

3.76 

6.09 

3.77 

LO -h CV 

6.19 

3.83 

6.31 

3.90 

LO -h C 

6.24 

3.84 

6.38 

3.94 

LO 

6.07 

3.77 

6.15 

3.83 


px 

1 px 



Fig. 16 Number of iterations that are required to converge 
for LO + SCV (top) and LO (bottom). (a,c) Distribution 
of frames where the pose estimation converged after a given 
number of iterations. (b,d) Cumulative distribution 


4.2.2 Component Evaluation 


C slightly increases the error, but without the term the 
hand poses are often physically implausible and inter¬ 
sect with the object. When comparing LO + SC and 
LO + SCV^ we see that the error is slightly reduced 
by the physics simulation component V. The pose es¬ 
timation errors for each sequence using LO + SCV are 
summarized in Table m 

Instead of using a hxed number of iterations per 
frame, a stopping criterion can be used. We use the av¬ 
erage change of the joint positions after each iteration. 
As threshold, we use 0.2mm and a maximum of 50 it¬ 
erations. Table 13 shows that for the stopping criterion 
the impact of the terms is slightly more prominent, but 
it also shows that the error is slightly higher for ah 
approaches. To analyze this more in detail, we report 
the distribution of required iterations until the stop¬ 
ping criterion is reached in Figure Although LO + 
SCV requires a few more iterations until convergence 
compared to LO, it converges in 10 or less iterations in 
92% of the frames, which supports our previous results. 
There are, however, very few frames where the approach 
has not converged after 50 iterations. In most of these 
cases, the local optimum of the energy is far away from 
the true pose and the error is increased with more iter¬ 
ations. These outliers are also the reason for the slight 
increase of the error in Table EHl For ah combinations 
from LO to LO + SCV we observed this behavior, which 
shows that the energy can be further improved. 


4.3 Limitations 


As shown in Sections |4.1| and |4.2[ our approach cap¬ 
tures accurately the motion of hands tightly interacting 
either with each other or with a rigid or articulated ob¬ 
ject. However, for very fast motion like the “random” 
sequence of the Dexter dataset our approach fails. Fur¬ 
thermore, we assume that a hand model is given or can 


be acquired by an approach like (Taylor et al 2014). 
Figure also visualizes an inaccurate hand pose of 
the lower hand due to missing depth data and two de¬ 
tections, which are not at the huger tips but located at 
other bones. 


Table [13| presents the evaluation of each component and 
their combinations for the seven sequences with hand- 
object interaction. Since the physical simulations V as¬ 
sumes that there are no severe intersections, it is mean¬ 
ingful only as a complement to the collision component 
C. One can observe that the differences between the 
components are relatively small since the hand poses in 
the hand-object sequences are in general simpler than 
the poses in the sequences with tight hand-hand inter¬ 
actions as considered in Section l4Tl The collision term 


4.4 Multicamera RGB 

We finally evaluated the approach for sequences cap¬ 
tured using a setup of 8 synchronized cameras record¬ 
ing FullHD footage at 50 fps. To this end, we recorded 
9 sequences that span a variety of hand-hand and 
hand-object interactions, namely: ‘Eraying’\ ^Einger- 
tips Touching ‘‘Fingertips Crossing “Fingers Cross¬ 
ing and Twisting^\ “Fingers Folding^ “Fingers Walk- 
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Table 14 Pose estimation error for each sequence 



Moving Ball 

1 hand 

Moving Ball 

2 hands 

Bending Pipe 

Bending Rope 

Moving Ball 

1 hand, occlusion 

Moving Cube 

1 hand 

Moving Cube 

1 hand, occlusion 


Mean Error 

6.10 

7.15 

6.09 

5.65 

8.03 

4.68 

5.55 

X 1 

“1 

Standard Deviation 

3.90 

4.82 

3.07 

3.04 

5.47 

2.61 

3.28 


ing” on the back of the hand, ‘'Holding and Passing a 
Bair\ “Paper Folding'' and “Rope Folding". The length 
of the sequences varies from 180 to 1500 frames. 

Figure 21 shows one frame from each of the tested 
sequences and the obtained results overlayed on the 
original frames from two different cameras. Visual in¬ 
spection reveals that the proposed algorithm works also 
quite well for multiple RGB cameras even in challenging 
scenarios of very closely interacting hands with multiple 
occlusions. The data and videos are available^] 


4.4B Component Evaluation 


As for the RGB-D sequences, we also evaluate the com¬ 
ponents of our approach. To this end, we synthesized 
two sequences: first, fingers crossing and folding, and 
second, holding and passing a ball, both similar to the 
ones captured in the real scenario. Videos were gener¬ 
ated using a commercial rendering software. The pose 
estimation accuracy was then evaluated both in terms 
of error in the joints position, and in terms of error in 
the bones orientation. 

Table shows a quantitative evaluation of the al¬ 
gorithm performance with respect to the used visual 
features. It can be noted that each feature contributes 
to the accuracy of the algorithm and that the salient 
points S clearly boost its performance. The benefit of 
the salient points is larger than for the RGB-D se¬ 
quences since the localization of the finger tips from 
several high-resolution RGB cameras is more accurate 
than from a monocular depth camera with lower res¬ 
olution. This is also indicated by the precision-recall 
curves in Figure p!3a| 


We also compared with (Oikonomidis et al 2011a) 


on the synthetic data where we used an own implemen¬ 
tation since the publicly available source code requires 
a single RGB-D sequence. We also added the salient 
points term and used two settings, namely 64 and 128 
particles over 40 generations. The results in Table \TE 
show that our approach estimates the pose with a lower 
error and confirm the results for the RGB-D sequences 
reported in Table 

^ http://files.is.tue.mpg.de/dtzionas/ 
hand-object-capture.html 


Table 15 Quantitative evaluation of the algorithm perfor¬ 
mance with respect to the used visual features: edges S, colli¬ 
sions C, optical flow O, and salient points S. LO stands for our 
local optimization approach, whil e HOPE64 and HOPE128 
stand for our implementation of ( [Oikonomidis et al| |2011a| ) 
with 64 and 128 particles respectively, evaluated over 40 gen¬ 
erations. 


Used features 

Mean 

St.Dev. 

Max 


LO + £: 

3.11 

4.52 

49.86 


LO + SC 

2.50 

2.89 

52.94 

LO + SCO 

2.38 

2.25 

16.84 

a 

LO + sees 

1.49 

1.44 

13.27 

HOPE64 + sees 

4.86 

3.69 

31.05 

HOPE128 + sees 

4.67 

3.28 

41.11 


Used features 

Mean 

St.Dev. 

Max 


LO + £: 

2.36 

6.84 

94.58 


LO + SC 

1.98 

4.57 

91.89 

LO + SCO 

1.84 

3.81 

60.09 

[Sap] 1 

LO + SCOS 

1.88 

3.90 

44.51 

HOPE64 + SCOS 

4.35 

7.11 

58.61 

HOPE128 + SCOS 

4.73 

7.46 

78.65 



Salient point detection rate [%] Number of iterations 


(a) 


(b) 


Fig. 17 Quantitative evaluation of the algorithm perfor¬ 
mance on noisy data, with respect to the salient point de¬ 
tection rate (a), and the number of iterations (b). Black bars 
indicate the standard deviation of the obtained error. 


In order to make the synthetic experiments as real¬ 
istic as possible, we simulated noise in all of the visual 
features. More precisely, edge detection errors were in¬ 
troduced by adding structural noise to the images, i.e. 
by adding and subtracting at random positions in each 
image 100 circles of radius varying between 10 and 30 
pixels. The optical flow features corresponding to those 
circles were also not considered. Errors in the salient 
point detector were simulated by randomly deleting de¬ 
tections as well as by randomly adding outliers in a 
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Table 16 Results obtained on the manually marked data 
for the multicamera RGB sequences. The table reports the 
distance in mm between the manually tracked 3D points and 
the corresponding vertices on the hand model. The figure 
shows the positions of the tracked points on the hand. 


Points 

median 

mean 

std 

max 

point 1 

06.98 

07.98 

3.54 

20.53 

point 2 

11.14 

12.28 

5.22 

23.48 

point 3 

10.91 

10.72 

4.13 

24.68 


and missing visual data. We performed qualitative and 
quantitative evaluations on 8 sequences captured with 
multiple RGB cameras and on 21 sequences captured 
with a single RGB-D camera. Gomparisons with an ap¬ 


proach based on particle swarm optimization (Oikono- 


midis et al 2011a) for both camera systems revealed the 


our model achieves a higher accuracy for hand pose es¬ 
timation. For the first time, we present successful track¬ 
ing results of hands interacting with highly articulated 
objects. 


radius of 200 pixels around the actual features. Gaus¬ 
sian noise of 5 pixels was further introduced on the 
coordinates of the resulting salient points. Figure [^a) 
shows the influence of the salient point detector on the 
accuracy of the pose estimation in case of noisy data. 
This experiment was run with a salient point false pos¬ 
itive rate of 10%, and with varying detection rates. It 
is visible that the error quickly drops very close to its 
minimum even with a detection rate of only 30%. 

Figure [^b) shows the convergence rate for differ¬ 
ent numbers of iterations. It can be noted that the al¬ 
gorithm accuracy becomes quite reasonable after just 
10 — 15 iterations, which is the same as for the RGB-D 
sequences. 

We also annotated one of the captured sequences 
for evaluation. Since annotating joints in multiple RGB 
cameras is more time consuming than annotating joints 
in a single RGB-D camera, we manually labeled only 
three points on the hands in all camera views of the se¬ 
quence ‘‘Holding and Passing a BalF. Since we obtain 
3D points by triangulation, we therefore use the 3D 
distance between these points and the corresponding 
vertices in the hand model as error metric. Table [Tbl 
shows the tracking accuracy obtained in this experi¬ 
ment. Overall, the median of the tracking error is at 
maximum 1cm. 
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5 Conclusion 

In this paper we have presented a framework that cap¬ 
tures articulated motion of hands and manipulated ob¬ 
jects from monocular RGB-D videos as well as multiple 
synchronized RGB videos. Gontrary to works that fo¬ 
cus on gestures and single hands, we focus on the more 
difficult case of intense hand-hand and hand-object in¬ 
teractions. To address the difficulties, we have proposed 
an approach that combines in a single objective func¬ 
tion a generative model with discriminatively trained 
salient points, collision detection and physics simula¬ 
tion. Although the collision and physics term reduce 
the pose estimation only slightly, they increase the real¬ 
ism of the captured motion, especially under occlusions 
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(a) Fingers Walking 



(b) Fingers Crossing 



(c) Fingers Crossing and Twisting 



(d) Fingers Dancing 



(d) Fingers Hugging 



(d) Fingers Grasping 



(d) Rock Gesture 



(d) Bunny Gesture 


Fig. 18 Some of the obtained results. (Left) Input RGB-D image. (Center-Left) Obtained results overlayed on the input 
image. (Center-Right) Obtained results fitted in the input point cloud. (Right) Obtained results from another viewpoint. 
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(a) “Moving a Cube” with occluded manipulating finger, Frame 083 



(b) “Moving a Cube” with occluded manipulating finger, Frame 106 



(c) “Moving a Cube” with occluded manipulating finger, Frame 125 



(d) “Moving a Cube”, Frame 085 



(e) “Moving a Ball” with 2 hands. Frame 113 



(f) (Left) “Bending a Pipe”, Frame 159. Right “Bending a Rope”, Frame 159 


Fig. 19 The impact of the physics component. For each image couple, the left image corresponds to LO + tSCx and the right 
one to LO + SCV. In the case of missing or ambiguous input visual data, as in sequences with occluded manipulating finger, 
the contribution of the physics component towards better physically plausible poses becomes more prominent 
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(a) “Moving a Ball” with 1 hand (new sequence) 




(d) “Bending a Rope” (new sequence) 



(e) “Moving a Ball” with occluded manipulating finger (new sequence) 



(f) “Moving a Cube” (new sequence) 



(g) “Moving a Cube” with occluded manipulating finger (new sequence) 


Fig. 20 Some of the obtained results. (Left) Input RGB-D image. (Center-Left) Obtained results overlayed on the input 
image. (Center-Right) Obtained results fitted in the input point cloud. (Right) Obtained results from another viewpoint. 
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(a) “Praying” 



(b) “Finger Tips Touching” 



(c) “Fingers Crossing” 



(d) “Fingers Crossing and Twisting” 



(e) “Fingers Folding” 



(f) “Fingers Walking” 



(g) “Holding and Passing a Ball” 



(h) “Paper Folding” (new sequence) 



(i) “Rope Folding” (new sequence) 


Fig. 21 Some of the obtained results. (Left) One of the input RGB images. (Center) Obtained results overlayed on the input 
image. (Right) Obtained results from another viewpoint. 







